1 Introduction

Housing prices are an important indicator of the strength of the economy. House price prediction can help real estate developers determine the selling price of a house, allow buyers make informed choices about potential purchases, and be beneficial for property investors in determining price trends across different locations. Hence having a simple predictive and inferential method to model housing prices can be of great significance to the financial market; however, predicting long-term housing prices has become a complex and challenging task. This paper discusses our project on determining how different factors may affect home sales price by building linear models. The data used in this project was collected in Melbourne, Australia in 2017. Melbourne is a large metropolitan city with a strong real estate market in a region of Australia that experienced a 4.2 percent growth rate in property sales 2017. We believe the factors that determine housing pricing in our model could have broad applications to other locations and countries.

Our project sought to answer the following questions:

  • Understand if housing prices in Melbourne, Australia can be predicted using this dataset.
  • Determine what variables have the greatest impact on housing price.
  • Analyze the impacts of location, seller, and construction attributes of homes on the housing market in Melbourne, Australia.

1.1 The Melbourne Housing Snapshot Dataset

The independent variables mainly reflect the situation of the house from three dimensions: a. what type; b. quality, grade; c. quantity, area. Before exploratory data analysis, the details and introduction of the existing Melbourne house data variables are as follows:

  • Home Sales in 2017
    • Location
    • Construction
    • Sale
  • Variables: 21
    • Numeric: 12
    • Categorical: 9

Rooms: Number of rooms

Price: Price (AUS$)

Method: Method of sale - 5 categories

Type: House, Unit, Townhouse - 3 categories

SellerG: Real Estate Agent - 268 categories

Date: Date sold

Distance: Distance from Central Business District

Regionname: Region name - 8 categories

Propertycount: Number of properties that exist in the suburb

Bedroom2 : Number of Bedrooms

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

YearBuilt: Year home built

CouncilArea: Governing council for the area - 34 categories

Lattitude, Longtitude: GPS location

Suburb: Suburb name - 314 categories

2 Exploratory Data Analysis (EDA)

The following are excerpts and graphs from our exploratory data analysis. This part of the project familiarizes the reader with our dataset’s attributes as well as lays the foundation for the variables we will include in our linear model. The results of our EDA will also inform the future direction of the project.

2.1 Summary of Price Statistics

SD: 639310.724

data.full$Price
Min 85000
Q1 650000
Median 903000
Mean 1075684
Q3 1330000
Max 9000000

2.2 Select Data Pairs

By these scatter plots of selected numerical variables, we can see there are couple of potential outliers when it comes to home size, land size, and selling price. This does make it a bit difficult to discern patterns in some of the pairings. Nonetheless, there does appear to be an inkling of linear correlation which we will explore further.

2.3 Corrleations

Due to the large number of feature columns in the dataset, it is difficult to grasp all the pairings in linear correlations. Therefore, before further feature mining, take a look at which variables are highly correlated with House Price. The below heatmap is interactive if viewing in HTML. Hover over the specific coordinate to view the correlation coefficient and the p-value for that variable pair.

The general trajectory of most of these correlations should be unsurprising. One would expect a home with more rooms to be positively correlated with both number of bedrooms and price. Alternatively, distance from the central business district is slightly negatively correlated with price. And there are also strong correlations between some variables, such as Rooms and Bedrooms, Rooms and Bathrooms, Bedrooms and Bathrooms, which may be a concern for multicollinearity. This will guide our feature selection for linear regression and bear out in the significance and variance inflation within the attempted models.

2.4 Map of Melbourne Sales

A visual analysis of the current housing sales distribution in Melbourne was carried out. The result is shown in the figure below.

It can be seen from the figure that the sales areas are mainly concentrated in Eastern metropolitan, Southern metropolitan, Northern Metropolitan, Western Metropolitan and South-Eastern metropolitan. Therefore, the fluctuation of housing prices will greatly affect these areas, and these areas account for about 5/6 of Melbourne.

2.5 Selling Price

Selling price is the variable that we would like to predict and infer upon by linear modeling. So, let us further explore its distribution.

By histogram, box-plot, and qq-plot, selling price appears skewed from normal. This makes sense as no sales were less than $85,000 and there is a theoretical hard stop on the left at $0. Similarly, housing prices can be quite high without theoretical limit. That pattern is clearly displayed here.

The log transformed price is normally distributed; this can be seen in the histogram, boxplot, and qq-plot; all showing only very slight deviation from normal. Hence, selling price is log-normal.

2.5.1 Selling Price’s Relationship to Select Categorical Variables

We suspect location, seller, and type of home interacts with sale price. Furthermore, by the above heat-map, the price is correlated to number of rooms which can be treated as categorical; the more rooms there are, the higher the price.

2.5.2 Price by Region

By box-plot, there does appear to be an dependence between region and price. Note that the data still appear non-normal and suspiciously log-normal. Although the number of observations in each level of region are sufficiently large, there is great variability in the size of each level. Western Victoria has only 32 sales but Southern Metropolitan had 4695. Since normality is not satisfied and with uneven sized levels, we cannot rely on the robustness of ANOVA to test for independence.

2.5.3 Price by Number of Rooms (<10 Rooms)

Due to the small size of levels of rooms greater than 9, they are omitted. Again, we can see data that one can suspect is log-normal. There appears to be in dependence based on the box-plot especially for homes with less than six rooms; this is less pronounced as number of rooms increases. Again, the size of the levels is highly uneven; hence, we cannot apply ANOVA testing for dependence in this case either.

2.5.4 Price by Type of Home

The pattern of non-normal data and uneven level sizes repeats when price is conditioned on type of home (house, apartment, townhouse). So, again, ANOVA is not appropriate.

2.5.5 Test of Independence by Group (Pearson \(\chi^2\))

Since ANOVA testing for independence is inadvisable given the data distribution of selling price, Pearson \(\chi^2\) testing is implemented. For this, price must be categorical; so, it is split into five uneven groups. This is done so that the highest price level have enough observations for the Pearson assumptions but does not include a large number of average priced homes. The variables being tested are seller (SellerG), number of rooms (Rooms), region (Regionname), and type of home (Type).

\(H_0\): All means equal by group

All reject \(H_0\) with p-value\(<2\times 10^{-16}\)

The null hypothesis is that there is no difference in means among each level of the four categorical variables that we test. As expected, we reject the null at \(\alpha=0.05\). We conclude that in each categorical variable, some level has at least one mean that is not equal to the the other levels. This is expected from both basic knowledge of housing markets and the box-plots above.

2.6 Price by Region and Type

The picture below shows the price by region and type of home. Many of the same patterns hold.

3 Linear Modelling

For building a ordinary least squares (OLS), 70% of the dataset is randomly selected as training and 30% is used for final testing. Upon testing on the validation set, the model is no longer altered.

3.1 First Attempt at Linear Model

In this first model, we set the regression model as: Price ~ Rooms + Landsize + Distance + Bedroom2 + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname). In order to determine whether the first model meets the requirements, the necessary variance inflation factor (VIF) checks are useful; we follow the convention to consider removal of variables with VIF\(>5\). From the plot below, we see that the bedroom2 is not suitable for house price regression analysis due to its high VIF. Thus, the second OLS regression model is modified with no bedroom2 variable.

3.2 Linear Model 2: Removed the Variable with Highest VIF

In this model, we set the regression model as: Price ~ Rooms + Landsize + Distance + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname). Similarly with Model 1, the VIF’s results shown below. As expected, we get essentially the same plot as before but note that a single level of region (Regionname) exceeds the VIF cutoff. We choose to keep it as all other levels are below the cutoff; furthermore, the VIF does not exceed 5 by some great amount. Hence Regionname is kept.

3.3 Model Coefficients

In this model, the factor (Regionname) Western Metropolitan is the variable with highest VIF value. The p-values for the model coefficients are as shown below. Although, it may be a unusual to plot p-values for coefficients, this is done to quickly identify variables that may require additional investigation.

The p-value of Landsize is the highest, which is larger than our assigned \(\alpha=0.05\). Thus, this variable is dropped. Again, we choose to maintain Regionname as most levels are significant at this \(\alpha\).

3.4 Linear Model 3: Considered Interactions

In this model, we analyze the interaction of rooms and region. The result is shown as follows:

The p-value for the model suggests that the interaction is not significant and is discarded.

3.5 Linear Model 4: Removed Land Size and without Interaction

Consequently from the above model analysis, the variable of landsize should be dropped and the Rooms-Regionname variable. The model is set as: Price ~ Rooms + Distance + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname).

Again, we see a few p-values that exceed our \(\alpha\) but these are just a few levels of Regionname which we have decided to keep. It is also important to note that \(R^2_{adj}\) was generally stable around 0.55 with all of the above models.

4 Residual Analysis

4.1 Homogeneity? No

There does appear to be a slight yet notable cone shape of the residual verus fitted values, suggesting some heterogeneity of variance. A transform may be considered.

4.2 Normal? Not Quite

By qq-plot, there does appear some deviation from normality of the standardized residuals. Again, transformation or alternative modeling may be considered.

4.3 Influence? Yes

There does appear to be one high influence point with leverage nearing 1 and an outlier of more than 10 standard deviations.

4.4 Remove Influence Points

Removal of the influence point is considered but does not greatly alter coefficients, their significance, or \(R^2_{adj}\). Since the model is not so different, and the reason for high influence not sufficiently understood, the observations are not removed from the final model.

4.5 Transform Data - Homogeneity

This is a residual analysis on the transformed log-Price linear model: log(Price) ~ Rooms + Distance + Bathroom + Car + BuildingArea + Lattitude + Longtitude + Propertycount + factor (Regionname).

We do see improvement of the cone shape and now no discernable pattern appears.

4.6 Transform Data - Normality

The qq-plot shows some mild improvement using the transformed data. With the robustness of OLD regression, we will not include the log-transform model knowing that it will also lose interpretibilty for inferential purposes.

5 Proposed Model

This is the final proposed model and details of related coefficients are shown as follows:

Observations 4955 (4548 missing obs. deleted)
Dependent variable Price
Type OLS linear regression
F(15,4939) 425.77
0.56
Adj. R² 0.56
Est. S.E. t val. p
(Intercept) -129445309.64 17287919.00 -7.49 0.00
Rooms 255713.04 9273.23 27.58 0.00
Distance -44501.15 1467.41 -30.33 0.00
Bathroom 115532.73 11453.40 10.09 0.00
Car 45801.33 7532.08 6.08 0.00
BuildingArea 1794.92 90.67 19.80 0.00
Lattitude -757249.47 124821.36 -6.07 0.00
Longtitude 696894.60 116542.04 5.98 0.00
Propertycount -3.69 1.52 -2.42 0.02
factor(Regionname)Eastern Victoria 188304.28 103229.34 1.82 0.07
factor(Regionname)Northern Metropolitan -55889.99 30219.79 -1.85 0.06
factor(Regionname)Northern Victoria 598550.10 116506.82 5.14 0.00
factor(Regionname)South-Eastern Metropolitan 169831.25 51506.18 3.30 0.00
factor(Regionname)Southern Metropolitan 212777.72 27309.69 7.79 0.00
factor(Regionname)Western Metropolitan -86943.29 38731.01 -2.24 0.02
factor(Regionname)Western Victoria 515064.40 135131.01 3.81 0.00
Standard errors: OLS

Although the residual analysis is somewhat concerning, we rely on the robustness of OLS regression to recommend this model. No influence points are omitted. There are, however, a large number of missing values for many variables.

5.1 Testing \(R^2\)

An alternative \(R^2\) is used to validate the model. This is done on the testing set:

\[ \begin{equation} R^2 = 1- \dfrac{RSS}{TSS} \end{equation}=0.441\]

This is roughly inline with what we would expect since the final model has \(R^@_{adj}=0.56\). Note that the large number of missing data are simply omitted from this value.

6 Conclusion

The previous analysis did not deal with outliers, and the processing of outliers may also have a certain effect on result optimization. Through the analysis of this data set, the content of linear regression was practiced, and the final effect was interpretable even if it does not excel at prediction. For this, alternative modeling methods must be considered. A more granular analysis of price and dependency would also be informative. Furthermore, a large number of missing values hinders model building.

6.1 Future Work

  • Further explore log transformation
  • Consider GLM with log link
  • Categorical variables with many levels
  • Missing data
  • Improve Prediction

7 8 Bibliography

Dataset available: https://www.kaggle.com/dansbecker/melbourne-housing-snapshot

Thorne,S. (2019, November 3) How the Australian Property Market Performed in 2017. Retrieved from www.openagent.com.au/blog/how-the-australian-property-market-performed-in-2017#.

Mansfield, E. R., & Helms, B. P. (1982). Detecting multicollinearity. The American Statistician, 36(3a), 158-160.

Daoud, J. I. (2017, December). Multicollinearity and regression analysis. In Journal of Physics: Conference Series (Vol. 949, No. 1, p. 012009). IOP Publishing.

Brownie, Cavell, and Dennis D. Boos. (1994). Type I Error Robustness of ANOVA and ANOVA on Ranks When the Number of Treatments Is Large. Biometrics, vol. 50, no. 2, 1994, p. 542.

Lix, Lisa M., et al. (1996). Consequences of Assumption Violations Revisited: A Quantitative Review of Alternatives to the One-Way Analysis of Variance ‘F’ Test. Review of Educational Research, vol. 66, no. 4, 1996, p. 579.